Notes on tidymodels
tidyverse
lubridate
janitor
skimr::skim()
Hmisc:: rcorr()
GGally::ggpairs() for pairplot
ggstatsplot::ggcorrmat()
jtools::export_summs() to see lm summary, plot_summs() to view coefficients
huxtable
interactions:: interact_plot() to visualize interactions, spinoff from jtools
ggplot
ggthemes for fancy themes (theme_wsj, theme_tufte, theme_gdocs, theme_fivethirtyeight, theme_few)
ggalt: Extr coordinate systems, statistical transformations
ggsci for jco style
ggthemr for predefined themes: fresh, greyscale, pale ggthemr(‘fresh’)
scales
ggfortify:: autoplot() for linear model, pca, clustering
ggstance:: coord_flip() simpler version
gridExtra, patchwork
tidymodels
broom, modelr
vip
doParallel, ranger, usemodelsyes for RF modelling
xgboost for XGBoost
glmnet for logistic regression
kernlab for SVM
kknn for KNN
DT
plotly
As part of data cleaning process, assign cleaned dataset to a new variable and do not overwrite the original dataset
Is it skewed (numerical y)?
Is it imbalanced (categorical y)?
Recode y to reflect the CORRECT reference variable.
Remove duplicated rows
step_rm()
Zero variance columns do not add useful information to the model.
step_zv()
step_nzv()
Remove.
Why are there missing values? Is it due to human error, or was it purposely omitted (not at random), for example, females do not report their age for surveys?
This will affect the decision on whether to remove the column (with missing values), or the rows with missing values (but will also remove other information), or to impute the missing data.
step_impute___
step_naomit()
Check with skim()
that the proper column types have been assigned.
Use summary()
to check on statistics and skim()
to check on distribution, or ggpairs()
can also be used
This is to see if scaling, transformation steps are required.
step_YeoJohnson()
step_BoxCox()
step_log()
This is to check if there are any outliers, and also for summary statistics.
Create a boxplot function to loop over all numeric columns.
This is to check if any transformations are required, and whether the data is unimodal, bimodal or multimodal. May need to discretise in such cases.
Create a geom_hist function to loop over all numeric columns.
Does x vary linearly with y? Or are polynomial transformations required?
Visualize the correlation plot, and also correlation matrix to see which variables are inter-correlated, and also which variables correlate well with y variable. The latter would be important as a predictor variable.
If there are highly correlated variables, can use step_corr()
to remove.
Scale affects certain models, especially for linear regression and PCA.
step_normalize()
step_center()
step_scale()
Code it to be the first variable.
Are there too many categories?
textrecipes::step_clean_names
to clean text column
step_other
to collapse if there are too many variables.
Change from factor to numerical using step_dummy
This requires domain knowledge.
step_interactions()
to create new predictor variables.
Split the data first before preprocessig to avoid data leakage.
Take note of using strata for initial_split()
.
set.seed(220629)
df_split <-
df_cleaned %>%
initial_split(prop = 0.75,
strate = y_variable_column_if_imbalanced)
df_train <-
df_split %>%
training()
df_test <-
df_split %>%
testing()
Tidymodels Feature Engineering
You can have different recipes, with different sets of preprocessed data, for eg:
recipe_all_variables
recipe_with_interactions
recipe_for_rf (need not normalise variables)
recipe_for_lr (need to normalise)
logistic_regression_model <-
logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
rf <-
rand_forest() %>%
set_args(trees = 1000,
mtry = tune()) %>%
# min_n = tune()) %>%
set_engine("ranger",
importance = "impurity") %>%
set_mode("classification")
xg_boost <-
boost_tree(trees = 1000,
mtry = tune(),
min_n = tune(),
tree_depth = tune(),
sample_size = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
workflow_rf <-
workflow() %>%
add_recipe(your_recipe) %>%
add_model(your_model_name)
# set up cv
set.seed(220628)
cv <-
df_train %>%
vfold_cv(10)
grid_rf <-
expand.grid(mtry = c(3,4,5,6))
set.seed(220627)
tuned_rf <-
workflow_rf %>% # workflow defined earlier
tune_grid(grid = grid_rf, # defined earlier
resamples = cv,
metrics = metrics_set(#sens,
#spec,
#f_meas,
accuracy,
roc_auc))
parameters_tuned_RF <-
tuned_rf %>%
select_best("roc_auc")
finalized_workflow_rf <-
workflow_rf %>% # earlier workflow
finalize_workflow(parameters_tuned_RF) # with tuned best parameters
fit_RF <-
finalized_workflow_rf %>% # workflow with best parameters
last_fit(df_split) # fit to whole dataset, predict on test
performance_rf <-
fit_RF %>% # last fit
collect_metrics() %>%
mutate(model_name = "Model A (RF)")
Use bind_cols
for several models
predictions_rf <-
fit_RF %>% # last fit
collect_predictions()
It is better to create your own function for creating this. And then create a tibble with prediction_results (df), actual_y, and map2 to tag confusion matrix to each set of prediction_results.
predictions_rf %>%
select(.pred_class, your_actual_y_column) %>%
conf_mat(estimate = .pred_class,
truth = your_actual_y_column) %>%
pluck(1) %>%
as_tibble() %>%
mutate() %>% # create col names for TP, TN, FP, FN based on prediction_y == Yes and actual_y == Yes etc
ggplot() +
geom_tile() +
scale_fill_manual() +
geom_label() +
geom_text()
# combined predictions by different models into a tibble using bind_rows
combined_tibble %>%
group_by(your_models_column_name) %>%
roc_curve(your_actual_y_column_name,
your_predicted_y_column_name) %>%
autoplot()
Split the data first before preprocessig to avoid data leakage
set.seed(220629)
df_split <-
df_cleaned %>%
initial_split(prop = 0.75)
df_train <-
df_split %>%
training()
df_test <-
df_split %>%
testing()
Tidymodels Feature Engineering
You can have different recipes, with different sets of preprocessed data, for eg:
recipe_all_variables
recipe_with_interactions
recipe_for_rf (need not normalise variables)
recipe_for_lr (need to normalise)
[Tidymodels Case Study][https://www.tidymodels.org/start/case-study/]
ols <-
linear_reg() %>%
set_engine("lm")
rf <-
rand_forest() %>%
set_args(trees = 1000,
mtry = tune(),
min_n = tune()) %>%
set_engine("ranger",
importance = "permutation") %>%
set_mode("regression")
You can set up multiple workflows
workflow_ols <-
workflow() %>%
add_recipe(your_recipe_name) %>%
add_model(your_model_name)
set.seed(22063001)
cv <- your_training_set %>%
vfold(cv)
tuned_ols <-
workflow_ols %>%
tune_grid(resamples = cv)
parameters_tuned_ols <-
tuned_ols %>% # from above
select_best(metric = "rmse")
finalized_workflow_ols <-
workflow_ols %>% # workflow set up earlier
finalize_workflow(parameters_tuned_ols) # best hyperparameters
fit_ols <-
finalized_workflow_ols %>% # workflow with best hyperparamters
last_fit(data_split) # data used for initial_split with training AND testing
Workflow sets allot holding multiple workflow objects, by crossing all combinations of preprocessors and model specifications. This set can then be tuned or resampled using a set of specific functions.
Have different recipes (recipe_base, recipe_filter_correlation)
Have different models (model_glm, model_knn)
Rather than creating 4 combinations of preprocessors and models, a workflow set can be created.
<-
workflow_SETS workflow_set(
preproc = list(simple = base_recipe,
filter = recipe_filter_correlation),
models = list(glmnet = model_glm,
knn = model_knn),
cross = T)
)
seed <- 20220706
CV <-
df_train %>%
vfold_cv(repeats = 10,
strata = y_variable_column_name)
# set up grid
Grid_control <-
control_grid(
save_pred = T,
save_workflow = T,
parallel_over = "everything"
)
tuned_grid <-
workflow_SETS %>%
workflow_map(
seed = seed,
resamples = CV,
control = Grid_control,
verbose = T,
grid = 10
)
# visualize
tuned_grid %>%
autoplot()
performance_ols <-
fit_ols %>% # last fitted model on testing dataset
collect_metrics() # default is rmse
predictions_ols <-
fit_ols %>% # last fitted model on testing dataset
collect_predictions() # will show .pred, actual_y
If y was transformed, change it back to original form.
Find out variable importance.
Refine model if needed to remove certain variables.
Plot predicted_y and actual_y, and use interactive tools to show any identifiers (eg name, id)
Preprocess again if needed.
This is an iterative process!
saveRDS("model_name.rds")
save.image("project_name_date.Rdata")
For attribution, please cite this work as
lruolin (2022, June 30). pRactice corner: Tidymodels notes. Retrieved from https://lruolin.github.io/myBlog/posts/20220630 - tidymodels notes/
BibTeX citation
@misc{lruolin2022tidymodels, author = {lruolin, }, title = {pRactice corner: Tidymodels notes}, url = {https://lruolin.github.io/myBlog/posts/20220630 - tidymodels notes/}, year = {2022} }